Introduction

We are looking to establish whether there is a correlation between the order a horse breezes and its breeze-up performance. Furthermore, we want to look at whether the breeze-up performance is indicative of future performance.

Data and Environment

Information regarding the following variables was collected from 2017-2020 regarding horses sold at breeze-up sales:

Analysis will be carried out in R relying heavily on the following packages:

There are other packages that are used and these are all found in libraries.R

Glossary of Terms

  1. pctSeqBreeze <- The position of the horse in the breeze order as a percentage relative to that particular sale.
  2. SeqBucket <- The bucket assigned to a horse based on it’s breeze order position.
  3. Time2f.Vmed <- The 2 furlong time of a horse relative to the median time for that breeze in seconds.
  4. ORFmax <- The maximum official rating that a horse has achieved follwing the sale. Where, O = Official Rating. RF = sourced from Raceform (RF) and max is the highest rating achieved.

Index

I have also conducted a small amount of further research in the postPresentation.Rmd file. This looks into a few of the ideas alluded to in Part 6.

Part 1: Is there a correlation between Breeze Order and Breeze Time?

We want to find out whether there is any correlation between the position in the order that a horse breezes in and the time in which it runs the breeze. Firstly, I will generate a summary table. I have grouped the horses into buckets by whether they breeze in the first, second or third segment of the horses that breeze. I have generated statistics based on these buckets. The statistics use time as Time2f.Vmed which is defined in the gloassry. I have also included a histogram showing that the horses are evenly distributed amongst the breeze order.

BucketNo. Min Max Mean Median IQR
1 -1.425 1.994 -0.039 -0.118 0.730
2 -1.393 1.994 0.005 -0.086 0.759
3 -1.462 1.992 0.109 0.044 0.840
Figure 1.1: A summary table showing time statistics for each order bucket with 1 being the earliest to Breeze and 3 being the latest.

Figure 1.2: A histogram showing the distribution of pctBreezeOrder which is the percentage breeze order.

This shows that we have an equal number of horses running in each proportion of the breeze which is what we would hope for from our data.

From looking at the mean and median statistics it appears that horses that breeze earlier in the order have a lesser breeze time than those that run later. To test that this difference is significant I will conduct some t-tests on the time differences between the buckets. Since we have a hypothesis about the direction of the effect, we will conduct a one-tailed t-test. I test the difference between bucket 1 and bucket 2, between bucket 1 and bucket 3 and finally between bucket 2 and bucket 3.
Buckets Tested t Value Degrees Of Freedom p Value
Bucket1 and Bucket2 -1.740 2174.809 0.041
Bucket1 and Bucket3 -5.619 2168.079 0.000
Bucket2 and Bucket3 -3.938 2166.363 0.000
Figure 1.3: A table showing a welch ttest testing the difference in 2 furlong time versus the median between order buckets.

We can see from the P values that there is a significant difference in breeze times for the horses and that the probility of this occurring due to chance is well below our threshold p value of p = 0.05. This probability is the smallest for the difference between bucket 1 and bucket 3 as expected.

I have also included a linear regression which allows a more visual interpretation of the data. Here I have not divided the data up in buckets. I still use the 2 furlong time vs the median 2 furlong time (s).

Term Estimate Standard Error Regression Statistic p Value
Intercept -0.0511852 0.0194528 -2.631 0.00854713
Explanatory Variable 0.0009976 0.0002124 4.697 0.00000275
Figure 1.4: A univariate linear regression plot and a data table containing the statistics for the linear regression with Breeze order as the explanatory variable and time vs the median as the response variable.

We have established that there is a correlation between breeze order and breeze time. There are a number of things that could explain this.

  1. Firstly, the ground throughout the day worsens as more and more horses run and this will have a significant impact on the speed at which the horse is able to run 2 furlongs
  2. Secondly, and more to our interest, are consignors pushing for their best horses to run earlier in the day and further, can this be used to predict the future ability of a horse?

This study will discuss the ability of breeze order to predict future ability of a horse. Specifically, this leads us onto the next question.

Part 2: Is there a correlation between Breeze Order and max future rating?

Having shown that there is a significant correlation between the order that a horse breezes and its breeze time I will now look at whether there is a correlation between the order in which a horse breezes and its future max rating. As in the last question, to start with, I have generated a summary table so that we can get a general idea of any correlation and the direction of the effect. We talk about max future rating as ORF max, this is defined in the glossary. The following table aims to show whether breeze order seems to have any correlation with ORFmax.

BucketNo. Min Max Mean Median IQR
1 4 121 73.299 73.0 25.00
2 4 114 69.884 71.0 25.00
3 5 119 69.412 69.5 22.25
Figure 2.1: A table showing ORFmax for each bucket by breeze order with 1 being the earliest to Breeze and 3 being the latest.
It is convenient to also include a linear regression plot for a visual interpretation of the data and the direction of the trend, if any.
Term Estimate Standard Error Regression Statistic p Value
Intercept 73.87072 0.8485 87.058 0.00000000
Explanatory Variable -0.05978 0.0150 -3.987 0.00006951
Figure 2.2: A data table containing the statistics for a linear regression with Breeze order as the explanatory variable and ORFmax response variable.

From looking at the above results we can see that there does appear to be the general trend that horses running later on in the breeze have a lower future max rating. We can also see that this correlation is not as pronounced as that between breeze time and future max rating.

I will now conduct a ttest in order to determine whether this trend that we can see is significant.

Buckets Tested t Value Degrees Of Freedom p Value
Bucket1 and Bucket2 3.216 1290.152 0.001
Bucket1 and Bucket3 3.710 1276.594 0.000
Bucket2 and Bucket3 0.436 1222.827 0.331
Figure 2.3: A welch ttest testing the difference in ORFmax between order buckets.

Upon ttesting the result for significance we see that there is a significant difference between the first and second buckets but not between the second and third. This suggests that breeze order may have some value to us and is worth investigating further. However, there are a few adjustments that we can make in order to generate some more valuable metadata. This is what we will do in the next question which involves generating an order score for the horses and using this rather than simply the percentage position.

Part 3: How do we generate a more representative breeze order relative to consignor sway?

The order of their own horses breezing is determined by the consignor. It interesting to look at whether there is any correlation between the size of the consignor and their average breezing position as if we adjust for this we might get a more accurate measure of how the horses are ordered relative to their predetermined future ability.

Here I have defined a small consignor as one that has had less than 25 horses at sale since we began collecting order data. To account for the different number of horses that run at each breeze I have used a percentage relative to the highest position at that sale to describe the breeze position.

Small Seller Average Breeze Position
FALSE 47.250
TRUE 57.813
Figure 3.1: A data table comparing average breeze position for small sellers and large sellers.

When comparing the average breeze position for a small seller against a larger seller we see that there is a clear difference between the average position of a small seller and a larger seller.

Data Name t Value Degrees of Freedom p Value
smallSellerSeq and largeSellerSeq 9.353 1579.019 0
Figure 3.2: A t-test to compare the difference in race position for small and large sellers

The t-test shows that this result is significant.

It would also be interesting to see if there are any particular consignors that stand out as having a more significant influence over position.

Seller Average Breeze Position Total No. Sold
Aguiar Bloodstock 45.147 34
Ardglas Stables 38.652 46
Ballinahulla Stables 37.970 33
Ballycullen Stables 52.133 30
Bansha House Stables 60.378 119
Bloodstock Connection 45.091 33
Brown Island Stables 37.359 78
Bushy Park Stables 60.481 27
CAJ Stables 35.385 39
Derryconnor Stud 55.040 50
Egmont Stud 49.855 76
Gaybrook Lodge 31.602 83
Grove Stud 39.078 90
Horse Park Stud 44.507 69
Hyde Park Stud 56.339 112
Katie Walsh 40.538 39
Kilminfoyle House Stud 55.062 32
Knockanglass Stables 58.452 177
Knockgraffon Stables 61.588 34
Lackendarra Stables 63.172 29
Longway Stables 49.138 87
Lynn Lodge Stud 46.417 36
Malcolm Bastard 44.024 41
Mayfield Stables 42.433 90
Meadow View Stables 36.603 68
Mocklershill 39.883 274
Oak Farm Stables 34.806 36
Oak Tree Farm 41.930 43
Powerstown Stud 48.176 68
Shanaville Stables 74.400 30
Sherbourne Lodge 69.280 82
Star Bloodstock 29.903 62
TallyHo Stud 46.650 180
Yeomanstown Stud 48.553 47
Small Sellers 57.813 898
Figure 3.3: A data table comparing small sellers to larger sellers

This table is too big at the moment. It is not obvious what it is showing for a reader but does have some interesting results It helps us to highlight that there are clearly a few consignors that have horses breezing earlier than others.

  • I am interested in looking at why a few of them have particularly low values and I wanted to show you this as you might find it interesting

Figure 3.4: A histogram showing the distribution of our breeze score.

This plot shows the breeze score with a normal distribution centered around a score 0 overlayed. The normal distribution is an okay fit

Linear Regression adjusted for consignor bias

I have given a position score to each horse based on the average position for that consignor I have put all small consignors (those who have sold less than 25 horses) together when calculating this adjustment.

Term Estimate Standard Error Regression Statistic p Value
Intercept 70.91491 0.43538 162.882 0.00000
Explanatory Variable -0.03939 0.01603 -2.458 0.01405
Figure 3.5: A univariate linear regression plot and a data table containing the statistics for the linear regression with Breeze order adjusted for consignor as the explanatory variable and time vs the median as the response variable.

The higher p value here (compared to regression with pctSeqBreeze) suggest that orderScore is actually not as good a predictor of ORFmax as pctSeqBreeze. Therefore, for the rest of our analysis we will use pctSeqBreeze where neccessary.

Part 4: Does breeze order give us any additional information?

The ultimate question that this study is asking is; if we look at the breeze time of a horse does having breeze order give us any additional information towards predicting the future value of the horse?

It is therefore important to try to separate order and breeze time, we have shown that they both have value in terms of predicting the future value of a horse, we now need to test for multicollinearity in order to see if they have value as individual variables.

Figure 4.1: A correlation matix comparing time, absolute breeze order, adjusted breeze order and ORFmax.

Figure 4.2: A pairwise scatter plot

From the correlation matrix and the plot we see again that pctBreezeOrder is a better value than orderScore for determining the future ability of a horse.

From the correlation matrix in part 4 it appears that the correlations between time and pctBreezeSeq is low enough that multicollinearity wont be too much of an issue.

## [1] "VIF = 1.10597496056056"
Variance Inflation Factor

We usually only consider multi-collinearity to be a problem when VIF > 10. Therefore, we assume that multicollinearity is not a problem and can move onto the next section.

Part 5: What is the best model for determining ORFmax?

As we have seen that pctBreezeSeq is a better measure than order score we will use this for the rest of the analysis. Here we aim to determine whether we actually gain any value through the use of pctSeqBreeze. By this I mean does having this extra variable actually add any value to our model? There, is no point in adding a variable to our model if it doesn’t improve it. Is a model involving both time and pctSeqBreeze better than one that only involves time.

In our various models:

  • x1 - pctSeqBreeze
  • x2 - time

y = m1x1 + c

Term Estimate Standard Error Regression Statistic p Value
Intercept 73.87072 0.8485 87.058 0.00000000
Explanatory Variable -0.05978 0.0150 -3.987 0.00006951
Figure 5.1: A regression between predictor pctSeqBreeze and ORFmax.

Our p-value of 0.00007 suggests that this negative trend is significant and pctSeqBreeze has value wrt to predicting ORFmax.

y = m2x2 + c

Term Estimate Standard Error Regression Statistic p Value
Intercept 70.102 0.4197 167.05 0.000e+00
Explanatory Variable -9.932 0.7134 -13.92 4.841e-42
Figure 5.2: A regression between predictor Time2f.Vmed and ORFmax.

Our p-value of 4.841e-42 suggests that this negative trend is significant and Time2f.Vmed has value wrt to predicting ORFmax.

We have seen that both factors when used on their own have significant value in predicting the future ability of a horse. However, we want to know if when they are used together this provides a better model than using just Time on its own.

y = m1x1 + m2x2 + c

Figure 5.3: A series of plots to help determine the effectiveness of the model when modelling ORFmax from predictors pctSeqBreeze and Time2f.Vmed.

We can see from our residuals vs fitted plot that there is no evidence of the fan shape characteristic of heteroscedasticity. This would mean that as the fitted values increase the variance of the residuals also increases. This does not appear to be the case.

The next plot is the QQ-plot. Though most of the points seem to fall on the line which indicates that our residuals come from a normal distribution, there are some points that stray from the line in the lower and upper quantiles of the plot. It is possible that these points do not come from a normal distribution, but most of our points seem to come from a normal distribution so there is not a lot to worry about here.

The third plot created is the scale-location plot. This plot is similar to the residual plot, but uses the square root of the standardized residuals instead of the residuals themselves. This makes trends in residuals more evident.

Finally, we see the leverage plot. This plot graphs the standardized residuals against their leverage. It also includes the Cook’s distance boundaries. Any point outside of those boundaries would be an outlier in the x direction. Since we cannot even see the boundaries on our plot, we can conclude that we have no outliers.

Est. S.E. t val. p
(Intercept) 72.083 0.821 87.799 0.000
pctSeqBreeze -0.040 0.014 -2.806 0.005
Time2f.Vmed -9.733 0.716 -13.601 0.000
Figure 5.4: The coefficient table for the multivariate regression
2.5 % 97.5 %
(Intercept) 70.473 73.693
pctSeqBreeze -0.069 -0.012
Time2f.Vmed -11.137 -8.330
Figure 5.5: The confidence intervals for the multivariate regression
F(2, 1910) R-sqd Adj-R-sqd
101.2061 0.0958 0.0949
Figure 5.6: The model fit statistics for the multivariate regression.

Refer to modellingORFmax.Rmd for some additional information looking into the best models. This makes it look like the best model is the one that includes both pctSeqBreeze and Time2f.Vmed. However, even this model does not do as good a job as measuring price. There are clearly other factors that people are paying for and can observe.

Conclusion of Initial Study and Areas for Future Research

Conclusion

The research into the correlation between breeze order and breeze time must be interpreted with caution. There is clearly a lot of noise when looing into the correlation between the two variables. The correlation that we are seeing may just be due to our large number of data points. However, I don’t believe this first section to have no value, so have left it in.

  • In conclusion, we have found that breeze order can be used to improve our initial time only model to predict the ORFmax of a horse.
  • This means that a horse running the same breeze time as another but later in the order is likely a worse horse.
  • This hints at the fact that consignors are putting their better horses earlier in the order and that this involves more than just how fast they are capable of running the breeze.
  • It will may useful to add breeze order to our current model for predicting ORFmax. Before doing this we must check that pctBreezeOrder is still significant when tested in combination with out current model.

Future Research

In the future it will be interesting to see how our findings relate to the prices that people end up paying for particular horses. It could be the case that people are placing too much emphasis on the Breeze Time of a horse and actually paying more for a horse than it is worth, purely due to this. We will need to somehow model the future value of a horse using ORFmax. I attempt to do this in postPresentation.Rmd.

It will also be useful to look at a similar question about breeze order but on a consignor level. An individual consignor gets allotted spots in the order and then gets to pick which horse of theirs they place in each of these spots. Therefore, a consignor picks completely the order of their own horses. If it is the case that a particular consignor always places their horses in the order that they think is representative of their future ability then this could be significant to us in predicting the future value of a horse. However, it is likely that different consignors have different tactics and that we only have enough data to elucidate the strategies of the largest consignors. Smaller consignors may not have sold enough horses to be able to tell this kind of thing.

We are also provided with a good basis for spotting edge cases. If a large consignor has placed a horse early in their order but it then runs a slow breeze time we might be inclined to underestimate the value of the horse if we do not take into account the fact that the consignor has chosen to place the horse early in their order. Does the consignor know something that the breeze time does not about the future ability of this particular horse. We may therefore be able to get good value on a horse that runs a slow breeze time through using the intuition of the consignor to uncover variables unknown to us.

Supplementary Part 1: How is price correlated with order and time?

As the general trend is that consignors seem to place their better horses earlier in the order it is interesting to look into whether this is financially beneficial. The first thing that I will look into is whether people pay more money for horses that are better.

Term Estimate Standard Error Regression Statistic p Value
Intercept 8.73564 0.090860 96.14 0.000e+00
Explanatory Variable 0.02429 0.001231 19.73 2.098e-78
Figure 6.1: The regression table with Price.GBP as the predictor and ORFmax as the response.

This clearly shows that there is a trend between the price that is paid for a horse and it’s future max rating. Next we want to try to determine some of the factors that result in paying more for the horse.

We will first look at time vs price.

Term Estimate Standard Error Regression Statistic p Value
Intercept 10.2853 0.01908 539.15 0.000e+00
Explanatory Variable -0.7887 0.03138 -25.13 1.426e-126

This shows that there is a strong correlation between time and price. The faster the time the higher the price. Interestingly there is a stronger correlation between time and price than there is between time and future max performance. Does this indicate that non time factors are more important in getting a good bargain?

Order vs Price

Term Estimate Standard Error Regression Statistic p Value
Intercept 10.639185 0.041364 257.21 0.000e+00
Explanatory Variable -0.007258 0.000718 -10.11 1.208e-23

This shows that there is a correlation between order and price. The lower in the order the cheaper the price.

Is there a better model if we use both order and time?

Est. S.E. t val. p
(Intercept) 10.569 0.038 279.227 0
pctSeqBreeze -0.006 0.001 -8.651 0
Time2f.Vmed -0.763 0.031 -24.483 0
2.5 % 97.5 %
(Intercept) 10.495 10.643
pctSeqBreeze -0.007 -0.004
Time2f.Vmed -0.824 -0.702
F(2, 2976) R-sqd Adj-R-sqd
361.0809 0.1953 0.1947

We see that both time and order are significant in this model.